# count items on columndomains_list = df['domains'].value_counts()# return first n rows in descending ordertop_domains = domains_list.nlargest(20)top_domains
Lista del top 20 de hashtags más usados y su frecuencia
Code
# convert dataframe column to listhashtags = df['hashtags'].to_list()# remove nan items from listhashtags = [x for x in hashtags ifnot pd.isna(x)]# split items into a list based on a delimiterhashtags = [x.split('|') for x in hashtags]# flatten list of listshashtags = [item for sublist in hashtags for item in sublist]# count items on listhashtags_count = pd.Series(hashtags).value_counts()# return first n rows in descending ordertop_hashtags = hashtags_count.nlargest(20)top_hashtags
# filter column from dataframeusers = df['mentioned_names'].to_list()# remove nan items from listusers = [x for x in users ifnot pd.isna(x)]# split items into a list based on a delimiterusers = [x.split('|') for x in users]# flatten list of listsusers = [item for sublist in users for item in sublist]# count items on listusers_count = pd.Series(users).value_counts()# return first n rows in descending ordertop_users = users_count.nlargest(20)top_users
# plot the data using plotlyfig = px.line(df, x='date', y='like_count', title='Likes over Time', template='plotly_white', hover_data=['text'])# show the plotfig.show()
Tokens
Lista del top 20 de los tokens más comunes y su frecuencia
Code
# load the spacy model for Portuguesenlp = spacy.load("pt_core_news_sm")# load stop words for SpanishSTOP_WORDS = nlp.Defaults.stop_words# Function to filter stop wordsdef filter_stopwords(text):# lower text doc = nlp(text.lower())# filter tokens tokens = [token.text for token in doc ifnot token.is_stop and token.text notin STOP_WORDS and token.is_alpha]return' '.join(tokens)# apply function to dataframe columndf['text_pre'] = df['text'].apply(filter_stopwords)# count items on columntoken_counts = df["text_pre"].str.split(expand=True).stack().value_counts()[:20]token_counts
assista 7856
q 5381
vídeo 4499
deus 3435
programa 3030
dia 2894
bolsonaro 2840
vitória 2782
cristo 2620
brasil 2418
acesse 2256
hoje 2189
vou 2168
pt 2073
divulgue 1981
ñ 1882
sábado 1816
imprensa 1727
lula 1721
imperdível 1715
Name: count, dtype: int64
Hora
Lista de las 10 horas con más cantidad de tweets publicados
Code
# extract hour from datetime columndf['hour'] = df['date'].dt.strftime('%H')# count items on columnhours_count = df['hour'].value_counts()# return first n rows in descending ordertop_hours = hours_count.nlargest(10)top_hours
Plataformas desde las que se publicaron contenidos y su frecuencia
Code
df['source_name'].value_counts()
source_name
Twitter Web Client 11191
Postcron App 10752
Twitter for iPad 8900
mLabs - Gestão de Redes Sociais 7243
Twitter for iPhone 2183
erased3412752 723
Twitter Ads 580
Twitter for Android 466
Twitter for Android Tablets 444
TweetDeck 424
Twitter Web App 303
Postgrain 144
Periscope 106
Twitter for BlackBerry® 98
Twitter for Advertisers. 65
Dynamic Tweets 63
Twitpic 7
Twitter for Websites 3
Mobile Web 3
iOS 3
Mobile Web (M2) 2
Instagram 2
Photos on iOS 1
Twitter Media Studio 1
audioBoom 1
Twitter for Windows Phone 1
Name: count, dtype: int64
Tópicos
Técnica de modelado de tópicos con transformers y TF-IDF
Code
# visualize topicstopic_model.visualize_topics()
Reducción de tópicos
Mapa con 10 tópicos del contenido de los tweets
Code
# visualize topicstopic_model.visualize_topics()
Términos por tópico
Code
topic_model.visualize_barchart(top_n_topics=11)
Análisis de tópicos
Selección de tópicos que tocan temas de género
Code
# # selection of topics# topics = [0]# keywords_list = []# for topic_ in topics:# topic = topic_model.get_topic(topic_)# keywords = [x[0] for x in topic]# keywords_list.append(keywords)# # flatten list of lists# word_list = [item for sublist in keywords_list for item in sublist]# # use apply method with lambda function to filter rows# filtered_df = df[df['text_pre'].apply(lambda x: any(word in x for word in word_list))]# percentage = round(100 * len(filtered_df) / len(df), 2)# print(f"Del total de {len(df)} tweets de @PastorMalafaia, alrededor de {len(filtered_df)} hablan sobre temas de género, es decir, cerca del {percentage}%")
Code
# # drop rows with 0 values in two columns# filtered_df = filtered_df[(filtered_df.like_count != 0) & (filtered_df.retweet_count != 0)]# # add a new column with the sum of two columns# filtered_df['impressions'] = (filtered_df['like_count'] + filtered_df['retweet_count'])/2# # extract year from datetime column# filtered_df['year'] = filtered_df['date'].dt.year# # remove urls, mentions, hashtags and numbers# p.set_options(p.OPT.URL)# filtered_df['tweet_text'] = filtered_df['text'].apply(lambda x: p.clean(x))# # Create scatter plot# fig = px.scatter(filtered_df, x='like_count', # y='retweet_count',# size='impressions', # color='year',# hover_name='tweet_text')# # Update title and axis labels# fig.update_layout(# title='Tweets talking about gender with most Likes and Retweets',# xaxis_title='Number of Likes',# yaxis_title='Number of Retweets'# )# fig.show()